Income prediction: Banking

18 July 2019

Problem

A bank offers different kind of bank deposits to their clients and want to classify them in order to recommend them a sort of deposit according to their needs

an image caption Source: Bank.

Description dataset

Data has 48678 observations and 15 variables. The rows represent people who live California and the columns some characteristics of them like age. The last col Income tell us if individual gain more than 50k dollards per year.

Age Workclass Fnlg Education Education_Num Marital_status Occupation Relationship
25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child
36 Empl-gov 212465 Bachelors 13 Married Adm-clerical Husband
25 Private 220931 Bachelors 13 Never-married Prof-specialty Not-in-family
22 Private 236427 HS-grad 9 Never-married Adm-clerical Own-child
Race Gender Capital_Gain Capital_Loss Hours_Per_Week Native_Country Income
Black Male 0 0 40 United-States <=50K
White Male 0 0 40 United-States <=50K
White Male 0 0 43 Peru <=50K
White Male 0 0 20 United-States <=50K

Data preproccesing

Data preproccesing

Missing values

Outliers detection

Data transformation

Missing values

Missing values

We remove rows with mising values, we have enough data

x
Age 0
Workclass 2799
Fnlg 0
Education 0
Education_Num 0
Marital_status 0
Occupation 2809
Relationship 0
Race 0
Gender 0
Capital_Gain 0
Capital_Loss 0
Hours_Per_Week 0
Native_Country 857
Income 0

Outliers detection

Outliers detection

We do not remove outliers because are relevant to the problem

Outliers detection

Data transformation

Data transformation

\[ {\large Benefits}={\large Capital Gain-CapitalLoss} \]

Capital_Gain Capital_Loss Benefits
0 1721 -1721
3103 0 3103
3674 0 3674
2174 0 2174
3411 0 3411
0 1721 -1721
2907 0 2907
4386 0 4386
5013 0 5013

Data transformation

We modify labels of categorical variable Marital_status

Data visualization

Questions

  • ¿Does an individual gain more than 50000 dollars per year?

Questions

Does variable Marital_status serve us to predict Income?

Data visualization

Data visualization

Data visualization

Analysis

Prepare data

We solve the imbalance class problem using undersampling

Training and testing

We split data in 3 sets train, validation and test

Methods

Clasification techniques:

  • Logistic Regression

  • Linear and quadratic discriminant analysis

  • K Nearest Neighbor

  • Random forest

  • Boosting

  • Support Vector Machines

  • Naive Bayes Classifier

Model selection

Model selection

We apply stepwise selection using all variables.

\[ {\Large Income \sim .} \]

Model selection

Model selection

Model selection

Model selection:

Model selection

We decide use 5 variables to train our models because to add variables do not increase significantly accuracy of model

\[ {\large Income \sim }{ \ Maritalstatus+Educationnum+Age+Hoursperweek+Benefits} \]

Model selection:

Age WorkclassEmpl-self WorkclassPrivate WorkclassWithout-pay Education_Num Marital_statusMarried Marital_statusNever-married Marital_statusSeparated Marital_statusWidowed Hours_Per_Week
1 ( 1 )
2 ( 1 )
3 ( 1 )
4 ( 1 )
5 ( 1 )
RelationshipOther-relative RelationshipOwn-child RelationshipUnmarried RelationshipWife RaceAsian-Pac-Islander RaceBlack RaceOther RaceWhite GenderMale Benefits
1 ( 1 )
2 ( 1 )
3 ( 1 )
4 ( 1 )
5 ( 1 )

Logistic Regression

Logistic Regression

\[ {\large Income \sim }{ \ Maritalstatus+Educationnum+Age+Hoursperweek+Benefits} \]

Training confusion matrix and statistics

<=50K >50K
<=50K 6942 1502
>50K 1980 6911
x
Accuracy 0.7991347
Sensitivity 0.7780767
Specificity 0.8214668

Logistic Regression

Coefficients and significativity of variables

Estimate Std..Error z.value Pr…z.. Significativity
(Intercept) -8.4274 0.1798 -46.8794 0.0000 TRUE
Education_Num 0.3828 0.0094 40.8500 0.0000 TRUE
Age 0.0300 0.0019 15.7987 0.0000 TRUE
Marital_statusMarried 2.2580 0.0678 33.2836 0.0000 TRUE
Marital_statusNever-married -0.3662 0.0854 -4.2907 0.0000 TRUE
Marital_statusSeparated -0.0815 0.1607 -0.5070 0.6121 FALSE
Marital_statusWidowed -0.1260 0.1591 -0.7917 0.4285 FALSE
Hours_Per_Week 0.0365 0.0020 18.5478 0.0000 TRUE
Benefits 0.0002 0.0000 21.9089 0.0000 TRUE

Logistic Regression:

Logistic Regression

Testing confusion matrix and statistics

<=50K >50K
<=50K 2299 479
>50K 616 2316
x
Accuracy 0.8082312
Sensitivity 0.7886792
Specificity 0.8286225

Logistic Regression

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Logistic Regression

Logistic Regression

Accuracy Sensitivity Specificity threshold
0.8007005 0.7008576 0.9048301 0.4000000
0.8043783 0.7413379 0.8701252 0.4500000
0.8070053 0.7835334 0.8314848 0.4941281
0.8082312 0.7886792 0.8286225 0.5000000
0.8010508 0.8102916 0.7914132 0.5300000
0.7945709 0.8370497 0.7502683 0.5700000
0.7781086 0.8679245 0.6844365 0.6200000

Logistic Regression

Comparisson models

numb_variables anova_logistic..Resid..Df. anova_logistic..Resid..Dev.
1 17330 19065.23
2 17329 16400.06
3 17328 16117.06
4 17327 15692.77
5 17326 14771.59
6 17313 14227.68
7 17320 14504.95
8 17307 13978.14
9 17307 13978.14

Discriminant analysis

Linear discriminant analysis

Training confusion matrix and statistics

<=50K >50K
<=50K 6475 1226
>50K 2447 7187
x
Accuracy 0.7881165
Sensitivity 0.7257341
Specificity 0.8542731

Linear discriminant analysis

Testing confusion matrix and statistics

<=50K >50K
<=50K 2114 377
>50K 801 2418
x
Accuracy 0.7936953
Sensitivity 0.7252144
Specificity 0.8651163

Linear discriminant analysis

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Linear discriminant analysis

Linear discriminant analysis

Accuracy Sensitivity Specificity threshold
0.7879159 0.6826758 0.8976744 0.4000000
0.7894921 0.7008576 0.8819320 0.4500000
0.7936953 0.7252144 0.8651163 0.5000000
0.7966725 0.7488851 0.8465116 0.5375239
0.8000000 0.7825043 0.8182469 0.5700000
0.7898424 0.8202401 0.7581395 0.6200000

Quadratic discriminant analysis

Training confusion matrix and statistics

<=50K >50K
<=50K 6880 1374
>50K 2042 7039
x
Accuracy 0.8029420
Sensitivity 0.7711275
Specificity 0.8366813

Quadratic discriminant analysis

Testing confusion matrix and statistics

<=50K >50K
<=50K 2269 454
>50K 646 2341
x
Accuracy 0.8073555
Sensitivity 0.7783877
Specificity 0.8375671

Quadratic discriminant analysis

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Quadratic discriminant analysis

Quadratic discriminant analysis

Accuracy Sensitivity Specificity threshold
0.7959720 0.7200686 0.8751342 0.4000000
0.8015762 0.7440823 0.8615385 0.4546504
0.8073555 0.7783877 0.8375671 0.5000000
0.8078809 0.7927959 0.8236136 0.5200000
0.7996497 0.8373928 0.7602862 0.5700000
0.7831874 0.8809605 0.6812165 0.6200000

K Nearest Neighbor

K Nearest Neighbor

Training confusion matrix and statistics

<=50K >50K
<=50K 6909 1127
>50K 2013 7286
x
Accuracy 0.8188636
Sensitivity 0.7743779
Specificity 0.8660407

K Nearest Neighbor

Testing confusion matrix and statistics

<=50K >50K
<=50K 2170 425
>50K 745 2370
x
Accuracy 0.7950963
Sensitivity 0.7444254
Specificity 0.8479428

K Nearest Neighbor

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

K Nearest Neighbor

K Nearest Neighbor

Accuracy Sensitivity Specificity Neighbors
0.7866900 0.7777015 0.7960644 1
0.7903678 0.7927959 0.7878354 2
0.7938704 0.7687822 0.8200358 3
0.7919440 0.7698113 0.8150268 4
0.7907180 0.7516295 0.8314848 5
0.7924694 0.7567753 0.8296959 6
0.7926445 0.7478559 0.8393560 7
0.7912434 0.7499142 0.8343470 8
0.7915937 0.7437393 0.8415027 9
0.7928196 0.7481990 0.8393560 10
0.7950963 0.7444254 0.8479428 11
0.7945709 0.7475129 0.8436494 12
0.7947461 0.7409949 0.8508050 13
0.7935201 0.7447684 0.8443649 14
0.7917688 0.7413379 0.8443649 15
0.7924694 0.7440823 0.8429338 16
0.7908932 0.7399657 0.8440072 17
0.7903678 0.7399657 0.8429338 18
0.7903678 0.7385935 0.8443649 19
0.7894921 0.7396226 0.8415027 20

K Nearest Neighbor

K Nearest Neighbor

Accuracy Sensitivity Specificity threshold
0.7851138 0.6627787 0.9127013 0.4000000
0.7938704 0.6974271 0.8944544 0.4500000
0.7970228 0.7276158 0.8694097 0.4902072
0.7950963 0.7444254 0.8479428 0.5000000
0.7905429 0.7807890 0.8007156 0.5500000
0.7851138 0.8312178 0.7370304 0.6200000

Random forest

Random forest

Training confusion matrix and statistics

<=50K >50K
<=50K 6890 1315
>50K 2022 7095
x
Accuracy 0.8073548
Sensitivity 0.7731149
Specificity 0.8436385

Random forest

Testing confusion matrix and statistics

<=50K >50K
<=50K 2304 380
>50K 611 2415
x
Accuracy 0.8264448
Sensitivity 0.7903945
Specificity 0.8640429

Random forest

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Random forest

Random forest

Random forest

Random forest

Accuracy Sensitivity Specificity threshold
0.8239930 0.7554031 0.8955277 0.40000
0.8246935 0.7749571 0.8765653 0.45000
0.8271454 0.7962264 0.8593918 0.50000
0.8271454 0.7962264 0.8593918 0.53125
0.8271454 0.7962264 0.8593918 0.55000
0.8169877 0.8267581 0.8067979 0.65000

Random forest

Boosting

Boosting

Training confusion matrix and statistics

<=50K >50K
<=50K 7133 1408
>50K 1789 7005
x
Accuracy 0.8155754
Sensitivity 0.7994844
Specificity 0.8326400

Boosting

Testing confusion matrix and statistics

<=50K >50K
<=50K 2313 398
>50K 602 2397
x
Accuracy 0.8248687
Sensitivity 0.7934820
Specificity 0.8576029

Boosting

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Boosting

Boosting

Eficient iteration’s number

Boosting

Boosting

Accuracy Sensitivity Specificity threshold
0.8112084 0.6939966 0.9334526 0.4000000
0.8199650 0.7396226 0.9037567 0.4500000
0.8278459 0.7886792 0.8686941 0.4912833
0.8248687 0.7934820 0.8576029 0.5000000
0.7994746 0.8888508 0.7062612 0.5300000
0.7530648 0.9416810 0.5563506 0.5700000
0.6604203 0.9938250 0.3127013 0.6200000

Support Vector Machines

Support Vector Machines

Training confusion matrix and statistics

<=50K >50K
<=50K 6221 1061
>50K 2752 7375
x
Accuracy 0.7809754
Sensitivity 0.6933021
Specificity 0.8742295

Support Vector Machines

Testing confusion matrix and statistics

<=50K >50K
<=50K 2047 343
>50K 912 2429
x
Accuracy 0.7810155
Sensitivity 0.6917878
Specificity 0.8762626

Support Vector Machines

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Support Vector Machines

Parameter cost

cost error dispersion
0.001 0.2586014 0.0109569
0.010 0.2441260 0.0089190
0.100 0.2321784 0.0087825
1.000 0.2213217 0.0097484
5.000 0.2197711 0.0088204
10.000 0.2197137 0.0086724
25.000 0.2190245 0.0099011

Naive Bayes clasiffier

Naive Bayes

Training confusion matrix and statistics

<=50K >50K
<=50K 8645 339
>50K 5651 2803
x
Accuracy 0.7792178
Sensitivity 0.7265138
Specificity 0.8352259

Naive Bayes

Testing confusion matrix and statistics

<=50K >50K
<=50K 2117 453
>50K 828 2301
x
Accuracy 0.7752237
Sensitivity 0.7188455
Specificity 0.8355120

Naive Bayes

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomCustomAnn() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Naive Bayes

usekernel fL adjust Accuracy Kappa AccuracySD KappaSD
TRUE 0 2 0.7778983 0.5570333 0.0064239 0.0128679
TRUE 1 2 0.7778983 0.5570333 0.0064239 0.0128679
TRUE 2 2 0.7778983 0.5570333 0.0064239 0.0128679
TRUE 3 2 0.7778983 0.5570333 0.0064239 0.0128679
TRUE 4 2 0.7778983 0.5570333 0.0064239 0.0128679
TRUE 5 2 0.7778983 0.5570333 0.0064239 0.0128679

Naive Bayes

Comparison of methods

Comparison of methods

Curvas ROC

Conclusions

Method Boosting is the best apropiated technique to perform the clasification model

The most influential variables are Marital_status, Education_Num, Benefits…

Particular case

Particular case

Now company want to analyze customers who participate in financial market. Company want to recommend different financial products to their clients according to their needs.

Data preproccesing

We do not remove outliers because are relevant to the problem

Data preproccesing

We want to analyze only customers who participate at financial market. This is, clients who has gained or loss capital. We filter dataset using this restriction.

\[ {Benefits}={Capital Gain}-{Capital Loss~} != {0} \]

Model selection

Best model using 5 variables

\[ {\large Income \sim }{ \ Maritalstatus+Educationnum+Relationship+Ocupation+Benefits} \]

Analysis

What are the most influential variables now?

Comparisson methods

END